Content
In the era of big data and technological advancement, the role of data science has become integral to driving innovation, efficiency, and strategic decision-making across diverse industries. As organizations increasingly rely on data-driven insights, the demand for skilled data scientists has surged, leading to a competitive job market where salary structures play a crucial role. This report undertakes a comprehensive exploration into the factors influencing data science salaries, aiming to uncover patterns, trends, and geographical variations that define compensation packages in this dynamic field.
The motivation behind this in-depth analysis lies in addressing the growing curiosity and necessity surrounding data science salaries. For aspiring data scientists, understanding the key determinants of compensation is essential in shaping career trajectories and making informed choices regarding skill development. Simultaneously, employers and industry stakeholders seek insights into the factors that attract and retain top-tier data science talent in order to remain competitive and innovative.
The field of data science is not static; it evolves with technological advancements, industry demands, and methodological innovations. Consequently, the motivation for this report is to offer a nuanced perspective on the salary landscape, moving beyond a superficial examination to delve into the specific factors that contribute to earning differentials within the profession. *Additionally, a particular emphasis will be placed on exploring which states and cities within the United States, as well as countries globally, offer the highest-paying data science roles*. By doing so, this report aims to provide a comprehensive understanding of the regional dynamics shaping data science salaries, offering valuable insights for both professionals and employers navigating the dynamic landscape of data science compensation.
In my preliminary attempt to predict salaries according to job descriptions, this report takes a step further by examining how various factors influence compensation within the dynamic realm of data science. This initial exploration sets the stage for a more comprehensive understanding of salary determinants and aims to contribute valuable insights for both professionals and employers navigating the intricate landscape of data science compensation.
Data URL: https://www.kaggle.com/datasets/thedevastator/jobs-dataset-from-glassdoor/
Data Type: CSV format
Total datasize: 741
Dataset Year: 2017
About: This dataset encapsulates job postings sourced from Glassdoor.com, spanning the period of 2017-2018 and focusing on positions within the United States. The dataset is enriched with a diverse set of features providing comprehensive insights into each job listing.
Key attributes include:
Job Title: The specific designation or role associated with the job posting.
Salary Estimate: An indication of the expected salary for the corresponding position.
Job Description: A detailed overview of the responsibilities and requirements associated with the job.
Rating: The rating assigned to the company, reflecting its overall reputation.
Company Name: The name of the hiring company.
Location: The geographical location of the job.
Headquarters: The location of the company's headquarters.
Size: The size of the company in terms of the number of employees.
Founded: The year the company was established.
Type of Ownership: The ownership structure of the company.
Industry: The industry to which the company belongs.
Sector: The sector in which the company operates.
Revenue: Information about the company's revenue.
Competitors: Identifies competitors in the industry.
Hourly: Indicates if the salary is offered on an hourly basis.
Employer Provided: Specifies whether the employer provides additional benefits.
Min Salary, Max Salary, Avg Salary: Different aspects of the salary information.
Company Text: A textual representation of the company's name.
Job State: The state in which the job is located.
Same State: Indicates if the job is in the same state as the company's headquarters.
Age: The age of the company, calculated from the founding year.
Python, R, Spark, AWS, Excel: Binary indicators showcasing the technologies or skills associated with the job.
This comprehensive dataset offers a wealth of information for analysis and exploration, making it valuable for understanding trends and patterns in the job market within the specified timeframe and region.
Data URl: https://www.kaggle.com/datasets/arnabchaki/data-science-salaries-2023
Data Type: CSV format
Total datasize: 3755
Dataset Year:2020-2023
The analysis was conducted utilizing a dataset that encompasses pertinent information regarding Data Scientists from 2020-2023. The dataset comprises the following variables:
Begin by installing the necessary dependencies. It's crucial to note that a specific version of SQLAlchemy is required due to compatibility issues with pandas, as SQLAlchemy 2.0 is currently incompatible. Additionally, nbformat is indispensable for leveraging the "notebook" formatter, as other formatters may not be rendered correctly to HTML. Ensure the installation of these dependencies to facilitate seamless functionality and proper rendering within the notebook environment.
%pip install pandas
%pip install plotly
%pip install SQLAlchemy==1.4.46
%pip install nbformat
%pip install matplotlib
%pip install seaborn
Requirement already satisfied: pandas in /Users/arpitahalder/opt/anaconda3/envs/saki/lib/python3.11/site-packages (1.5.3) Requirement already satisfied: python-dateutil>=2.8.1 in /Users/arpitahalder/opt/anaconda3/envs/saki/lib/python3.11/site-packages (from pandas) (2.8.2) Requirement already satisfied: pytz>=2020.1 in /Users/arpitahalder/opt/anaconda3/envs/saki/lib/python3.11/site-packages (from pandas) (2023.3.post1) Requirement already satisfied: numpy>=1.21.0 in /Users/arpitahalder/opt/anaconda3/envs/saki/lib/python3.11/site-packages (from pandas) (1.24.2) Requirement already satisfied: six>=1.5 in /Users/arpitahalder/opt/anaconda3/envs/saki/lib/python3.11/site-packages (from python-dateutil>=2.8.1->pandas) (1.16.0) Note: you may need to restart the kernel to use updated packages. Requirement already satisfied: plotly in /Users/arpitahalder/opt/anaconda3/envs/saki/lib/python3.11/site-packages (5.18.0) Requirement already satisfied: tenacity>=6.2.0 in /Users/arpitahalder/opt/anaconda3/envs/saki/lib/python3.11/site-packages (from plotly) (8.2.3) Requirement already satisfied: packaging in /Users/arpitahalder/opt/anaconda3/envs/saki/lib/python3.11/site-packages (from plotly) (23.1) Note: you may need to restart the kernel to use updated packages. Requirement already satisfied: SQLAlchemy==1.4.46 in /Users/arpitahalder/opt/anaconda3/envs/saki/lib/python3.11/site-packages (1.4.46) Requirement already satisfied: greenlet!=0.4.17 in /Users/arpitahalder/opt/anaconda3/envs/saki/lib/python3.11/site-packages (from SQLAlchemy==1.4.46) (2.0.2) Note: you may need to restart the kernel to use updated packages. Requirement already satisfied: nbformat in /Users/arpitahalder/opt/anaconda3/envs/saki/lib/python3.11/site-packages (5.9.2) Requirement already satisfied: fastjsonschema in /Users/arpitahalder/opt/anaconda3/envs/saki/lib/python3.11/site-packages (from nbformat) (2.16.2) Requirement already satisfied: jsonschema>=2.6 in /Users/arpitahalder/opt/anaconda3/envs/saki/lib/python3.11/site-packages (from nbformat) (4.19.2) Requirement already satisfied: jupyter-core in /Users/arpitahalder/opt/anaconda3/envs/saki/lib/python3.11/site-packages (from nbformat) (5.5.0) Requirement already satisfied: traitlets>=5.1 in /Users/arpitahalder/opt/anaconda3/envs/saki/lib/python3.11/site-packages (from nbformat) (5.7.1) Requirement already satisfied: attrs>=22.2.0 in /Users/arpitahalder/opt/anaconda3/envs/saki/lib/python3.11/site-packages (from jsonschema>=2.6->nbformat) (23.1.0) Requirement already satisfied: jsonschema-specifications>=2023.03.6 in /Users/arpitahalder/opt/anaconda3/envs/saki/lib/python3.11/site-packages (from jsonschema>=2.6->nbformat) (2023.7.1) Requirement already satisfied: referencing>=0.28.4 in /Users/arpitahalder/opt/anaconda3/envs/saki/lib/python3.11/site-packages (from jsonschema>=2.6->nbformat) (0.30.2) Requirement already satisfied: rpds-py>=0.7.1 in /Users/arpitahalder/opt/anaconda3/envs/saki/lib/python3.11/site-packages (from jsonschema>=2.6->nbformat) (0.10.6) Requirement already satisfied: platformdirs>=2.5 in /Users/arpitahalder/opt/anaconda3/envs/saki/lib/python3.11/site-packages (from jupyter-core->nbformat) (3.10.0) Note: you may need to restart the kernel to use updated packages. Requirement already satisfied: matplotlib in /Users/arpitahalder/opt/anaconda3/envs/saki/lib/python3.11/site-packages (3.8.2) Requirement already satisfied: contourpy>=1.0.1 in /Users/arpitahalder/opt/anaconda3/envs/saki/lib/python3.11/site-packages (from matplotlib) (1.2.0) Requirement already satisfied: cycler>=0.10 in /Users/arpitahalder/opt/anaconda3/envs/saki/lib/python3.11/site-packages (from matplotlib) (0.12.1) Requirement already satisfied: fonttools>=4.22.0 in /Users/arpitahalder/opt/anaconda3/envs/saki/lib/python3.11/site-packages (from matplotlib) (4.47.0) Requirement already satisfied: kiwisolver>=1.3.1 in /Users/arpitahalder/opt/anaconda3/envs/saki/lib/python3.11/site-packages (from matplotlib) (1.4.5) Requirement already satisfied: numpy<2,>=1.21 in /Users/arpitahalder/opt/anaconda3/envs/saki/lib/python3.11/site-packages (from matplotlib) (1.24.2) Requirement already satisfied: packaging>=20.0 in /Users/arpitahalder/opt/anaconda3/envs/saki/lib/python3.11/site-packages (from matplotlib) (23.1) Requirement already satisfied: pillow>=8 in /Users/arpitahalder/opt/anaconda3/envs/saki/lib/python3.11/site-packages (from matplotlib) (10.2.0) Requirement already satisfied: pyparsing>=2.3.1 in /Users/arpitahalder/opt/anaconda3/envs/saki/lib/python3.11/site-packages (from matplotlib) (3.1.1) Requirement already satisfied: python-dateutil>=2.7 in /Users/arpitahalder/opt/anaconda3/envs/saki/lib/python3.11/site-packages (from matplotlib) (2.8.2) Requirement already satisfied: six>=1.5 in /Users/arpitahalder/opt/anaconda3/envs/saki/lib/python3.11/site-packages (from python-dateutil>=2.7->matplotlib) (1.16.0) Note: you may need to restart the kernel to use updated packages. Requirement already satisfied: seaborn in /Users/arpitahalder/opt/anaconda3/envs/saki/lib/python3.11/site-packages (0.13.1) Requirement already satisfied: numpy!=1.24.0,>=1.20 in /Users/arpitahalder/opt/anaconda3/envs/saki/lib/python3.11/site-packages (from seaborn) (1.24.2) Requirement already satisfied: pandas>=1.2 in /Users/arpitahalder/opt/anaconda3/envs/saki/lib/python3.11/site-packages (from seaborn) (1.5.3) Requirement already satisfied: matplotlib!=3.6.1,>=3.4 in /Users/arpitahalder/opt/anaconda3/envs/saki/lib/python3.11/site-packages (from seaborn) (3.8.2) Requirement already satisfied: contourpy>=1.0.1 in /Users/arpitahalder/opt/anaconda3/envs/saki/lib/python3.11/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.2.0) Requirement already satisfied: cycler>=0.10 in /Users/arpitahalder/opt/anaconda3/envs/saki/lib/python3.11/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (0.12.1) Requirement already satisfied: fonttools>=4.22.0 in /Users/arpitahalder/opt/anaconda3/envs/saki/lib/python3.11/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (4.47.0) Requirement already satisfied: kiwisolver>=1.3.1 in /Users/arpitahalder/opt/anaconda3/envs/saki/lib/python3.11/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.4.5) Requirement already satisfied: packaging>=20.0 in /Users/arpitahalder/opt/anaconda3/envs/saki/lib/python3.11/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (23.1) Requirement already satisfied: pillow>=8 in /Users/arpitahalder/opt/anaconda3/envs/saki/lib/python3.11/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (10.2.0) Requirement already satisfied: pyparsing>=2.3.1 in /Users/arpitahalder/opt/anaconda3/envs/saki/lib/python3.11/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (3.1.1) Requirement already satisfied: python-dateutil>=2.7 in /Users/arpitahalder/opt/anaconda3/envs/saki/lib/python3.11/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (2.8.2) Requirement already satisfied: pytz>=2020.1 in /Users/arpitahalder/opt/anaconda3/envs/saki/lib/python3.11/site-packages (from pandas>=1.2->seaborn) (2023.3.post1) Requirement already satisfied: six>=1.5 in /Users/arpitahalder/opt/anaconda3/envs/saki/lib/python3.11/site-packages (from python-dateutil>=2.7->matplotlib!=3.6.1,>=3.4->seaborn) (1.16.0) Note: you may need to restart the kernel to use updated packages.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as ex
import plotly.graph_objects as go
import plotly.io as pio
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.pipeline import Pipeline
import pickle
import matplotlib.ticker as ticker
clean_salary.sqlite and global_salary.sqlite is obtained from datasource 1 and datasource 2 respectively
# Read the table clean_salary into a Pandas dataframe
df = pd.read_sql_table('clean_salary', 'sqlite:///../data/clean_salary.sqlite')
# Read the table global_salary into a Pandas dataframe
df2 = pd.read_sql_table('ds_salaries', 'sqlite:///../data/ds_salaries.sqlite')

Preview of the dataset:
df.head()
| Job Title | Salary Estimate | Job Description | Rating | Location | Headquarters | Size | Founded | Type of ownership | Industry | ... | company_txt | job_state | same_state | age | python_yn | R_yn | spark | aws | excel | company | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Data Scientist | $53K-$91K (Glassdoor est.) | Data Scientist\nLocation: Albuquerque, NM\nEdu... | 3.8 | Albuquerque, NM | Goleta, CA | 501 to 1000 employees | 1973 | Company - Private | Aerospace & Defense | ... | Tecolote Research\n | NM | 0 | 47 | 1 | 0 | 0 | 0 | 1 | Tecolote Research |
| 1 | Healthcare Data Scientist | $63K-$112K (Glassdoor est.) | What You Will Do:\n\nI. General Summary\n\nThe... | 3.4 | Linthicum, MD | Baltimore, MD | 10000+ employees | 1984 | Other Organization | Health Care Services & Hospitals | ... | University of Maryland Medical System\n | MD | 0 | 36 | 1 | 0 | 0 | 0 | 0 | University of Maryland Medical System |
| 2 | Data Scientist | $80K-$90K (Glassdoor est.) | KnowBe4, Inc. is a high growth information sec... | 4.8 | Clearwater, FL | Clearwater, FL | 501 to 1000 employees | 2010 | Company - Private | Security Services | ... | KnowBe4\n | FL | 1 | 10 | 1 | 0 | 1 | 0 | 1 | KnowBe4 |
| 3 | Data Scientist | $56K-$97K (Glassdoor est.) | *Organization and Job ID**\nJob ID: 310709\n\n... | 3.8 | Richland, WA | Richland, WA | 1001 to 5000 employees | 1965 | Government | Energy | ... | PNNL\n | WA | 1 | 55 | 1 | 0 | 0 | 0 | 0 | PNNL |
| 4 | Data Scientist | $86K-$143K (Glassdoor est.) | Data Scientist\nAffinity Solutions / Marketing... | 2.9 | New York, NY | New York, NY | 51 to 200 employees | 1998 | Company - Private | Advertising & Marketing | ... | Affinity Solutions\n | NY | 1 | 22 | 1 | 0 | 0 | 0 | 1 | Affinity Solutions |
5 rows × 28 columns
# Count the number of occurrences of each job title
job_title_counts = df['Job Title'].value_counts()
# Plot the top 10 job titles
plt.figure(figsize=(10, 6))
sns.barplot(x=job_title_counts.head(10), y=job_title_counts.head(10).index, orient='h')
plt.title('Top 10 Job Titles')
plt.xlabel('Count')
plt.ylabel('Job Title')
plt.show()
Inference 1: The analysis reveals a substantial prevalence of job listings for Data Scientist and Data Engineer roles, suggesting a robust demand for these professions in the United States. This observation implies a noteworthy abundance of opportunities in comparison to other data-related jobs, emphasizing the significance of these roles in the American job market.
ownership_counts = df['Type of ownership'].value_counts()
print("Ownership Types and Counts:")
print(ownership_counts)
Ownership Types and Counts: Company - Private 402 Company - Public 193 Nonprofit Organization 54 Subsidiary or Business Segment 34 Government 15 Hospital 15 College / University 13 Other Organization 3 School / School District 2 Unknown 1 Name: Type of ownership, dtype: int64
Inference 2: The fact that most companies are privately owned suggests there are plenty of opportunities in the private sector. For entrepreneurs, it's worth considering ventures in data technology, which is booming and gaining importance in the American business scene.
top_industries = df['Industry'].value_counts().head(5)
print("Top 5 Industries with the Most Job Listings:")
print(top_industries)
Top 5 Industries with the Most Job Listings: Biotech & Pharmaceuticals 112 Insurance Carriers 63 Computer Hardware & Software 59 IT Services 50 Health Care Services & Hospitals 49 Name: Industry, dtype: int64
# Filter the DataFrame for 'Biotech & Pharmaceuticals'
top_companies_bioPharma = df[df['Industry'] == 'Biotech & Pharmaceuticals']
# Display a pie chart showing the proportion of average salaries for the top 5 companies
fig = ex.pie(top_companies_bioPharma.nlargest(9, 'avg_salary'), values='avg_salary', names='company', title='Proportion of Average Salaries in Top 5 Biotech & Pharma Companies')
# Show the plot
fig.show()
Inference 3 : Based on the analysis, the Biotech & Pharmaceuticals industry emerges as the sector with the highest number of job listings in data science. Notably, within this industry, companies such as Genentech and Pfizer stand out for offering comparatively higher salaries, reflecting a distinct salary trend among top players in the field.
revenue_distribution = df['Revenue'].value_counts()
print("Revenue Distribution:")
print(revenue_distribution)
Revenue Distribution: Unknown / Non-Applicable 195 $10+ billion (USD) 124 $100 to $500 million (USD) 91 $1 to $2 billion (USD) 60 $500 million to $1 billion (USD) 57 $50 to $100 million (USD) 46 $25 to $50 million (USD) 40 $2 to $5 billion (USD) 39 $10 to $25 million (USD) 32 $5 to $10 billion (USD) 19 $5 to $10 million (USD) 18 $1 to $5 million (USD) 8 Less than $1 million (USD) 3 Name: Revenue, dtype: int64
Inference 4: The analysis of revenue distribution showcases a diverse landscape. A considerable number of companies fall under the "Unknown / Non-Applicable" category, indicating a lack of available revenue information for these entities. Among companies with disclosed revenue figures, a notable proportion belongs to the 10+ billion (USD) bracket, underscoring the presence of major corporations in the dataset. Additionally, there is a significant representation in the mid-range, with a considerable number of companies reporting revenues between 100 million to 2 billion (USD), highlighting a diverse mix of companies operating at various scales.
high_salary_100 = df['avg_salary'].sort_values(ascending=False).head(100)
high_salary_100_data = df.loc[high_salary_100.index]
high_salary_100_data['salary_grade'] = 'high'
low_salary_100 = df['avg_salary'].sort_values(ascending=True).head(100)
low_salary_100_data = df.loc[low_salary_100.index]
low_salary_100_data['salary_grade'] = 'low'
ex.pie(high_salary_100_data, names='Job Title',title='High Salary Job',hole=0.3, width=1000, height=800)
ex.pie(low_salary_100_data, names='Job Title',title='Low salary Job',hole=0.3, width=1000, height=800)
Inference 5:
Based on the analysis, it is evident that within the data field, the highest salaries are earned by Data Scientists, constituting 16% of the roles, followed by Senior Data Scientists at 10%. Lead Data Scientists and Lead Data Engineers secure the next positions with 6% and 3%, respectively. The hierarchy indicates that Data Scientists generally command the highest salaries, trailed by Data Engineers and Data Managers in descending order.
The analysis reveals a disparity in salaries among Data Analyst roles, with Marketing Data Analysts and Senior Data Analysts earning considerably more than Research Scientists, Staff Scientists and Junior Analysts. This highlights distinct salary variations within the Data Analyst domain, emphasizing the impact of specialization and experience on compensation levels.
high_low_salary_200 = pd.concat([high_salary_100_data, low_salary_100_data])
high_low_salary_200_location = high_low_salary_200[['Location', 'min_salary', 'max_salary', 'avg_salary', 'salary_grade','Job Title','Industry']]
v_split = high_low_salary_200_location.Location.str.split(', ')
high_low_salary_200_location['city'] = v_split.str.get(0)
high_low_salary_200_location['state'] = v_split.str.get(1)
high_salary_location = high_low_salary_200_location[high_low_salary_200_location['salary_grade']=='high']
low_salary_location = high_low_salary_200_location[high_low_salary_200_location['salary_grade']=='low']
/var/folders/s0/9btq4s49301fpm3k_2styy_80000gn/T/ipykernel_9645/887893565.py:4: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy /var/folders/s0/9btq4s49301fpm3k_2styy_80000gn/T/ipykernel_9645/887893565.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
fig = ex.pie(high_salary_location,values='avg_salary',names='city',title='City with highest salary in US',hover_data=['avg_salary','max_salary'],labels={'avg_salary':'Average Salary','max_salary':'Maximum Salary'})
fig.update_traces(textposition='inside',textinfo='percent+label')
fig = ex.pie(low_salary_location,values='avg_salary',names='city',title='City with lowest salary in US',hover_data=['avg_salary','max_salary'],labels={'avg_salary':'Average Salary','max_salary':'Maximum Salary'})
fig.update_traces(textposition='inside',textinfo='percent+label')
Inference 6: The analysis indicates that San Francisco contributes approximately 21% of the top 100 highest-paying cities, showcasing its prominence in lucrative employment opportunities. Surprisingly, cities like New York, Chicago, and Boston, while featuring prominently among the highest-paying cities, constituting about 7% of the highest-paying cities,is also in city with low salary revealing regional disparities in salary distributions.
current_year = 2024 # Update with the current year
high_salary_100_data['company_age'] = current_year - high_salary_100_data['Founded']
# Explore technology requirements by age group (assuming you have columns like 'Python', 'R', 'Spark', 'AWS', 'Excel')
technology_columns = ['python_yn', 'R_yn', 'spark', 'aws', 'excel']
technology_by_age = high_salary_100_data.groupby('company_age')[technology_columns].mean().reset_index()
# Visualize technology requirements by age group
technology_by_age.plot(x='company_age', kind='bar', stacked=True, figsize=(10, 6))
plt.xlabel('Company Age (Years)')
plt.ylabel('Percentage of Job Postings')
plt.title('Technology Requirements by Company Age')
plt.legend(title='Technology')
plt.show()
Inference 7: Python emerges as the predominant technology adopted universally across companies, spanning a wide age range from 7 to 173 years. Excel exhibits pervasive usage across companies irrespective of their age. While Spark and AWS find application in select companies, notably absent is the utilization of R language. Consequently, mastering Python and Excel is deemed essential for proficiency in the field.
#Plotting the scatter plot
plt.figure(figsize=(10, 6))
plt.scatter(high_salary_100_data['company_age'], high_salary_100_data['avg_salary'], color='b', alpha=0.7)
# Setting x-axis limits to scale employee age up to 100
plt.xlim(0, 100)
# Adding labels and title
plt.xlabel('Company Age')
plt.ylabel('Average Salary')
plt.title('Scatter Plot: Company Age vs Average Salary')
# Display the plot
plt.grid(True)
plt.show()
Inference 8: The analysis indicates that companies in the field of data science, which pay high salaries, are predominantly those under 20 years of age. Organizations aged between 20 and 60 years offer moderate salary ranges. Notably, there is a discernible trend suggesting that newer companies or startups have a greater propensity to offer higher salaries compared to their older counterparts.
This code aims to create a machine learning model for predicting job salaries based on job descriptions. It uses a pipeline with CountVectorizer for text representation and Linear Regression as the predictive model. The model is trained on this dataset, and after evaluation, it is saved to a file ('LinearRegression.pickle'). The trained model is then used to make predictions on a test set, and finally, it demonstrates the capability to predict the salary for a new job description provided as 'new_job_description'.
# Creating a DataFrame
df = pd.DataFrame(df)
# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
df['Job Description'], df['avg_salary'], test_size=0.2, random_state=42
)
# Creating a simple pipeline with CountVectorizer and Linear Regression
model = Pipeline([
('vectorizer', CountVectorizer()),
('regressor', LinearRegression())
])
# Training the model
model.fit(X_train, y_train)
with open('LinearRegression.pickle', 'wb') as f:
pickle.dump(model, f)
# Making predictions on the test set
predictions = model.predict(X_test)
# Evaluating the model
mse = mean_squared_error(y_test, predictions)
print(f'Mean Squared Error: {mse}')
# This model can be used to predict the salary for a new job description
new_job_description = ['data scientist with expertise in machine learning and Python']
predicted_salary = model.predict(new_job_description)
print(f'Predicted Salary: {predicted_salary[0]}')
Mean Squared Error: 732.7657919299878 Predicted Salary: 88.92283975020152
# R2 score
model.score(X_test, y_test)
0.6294190906432551
y_pred = model.predict(X_test)
fig = ex.scatter(x=y_pred, y=y_test, labels={'x': 'Predicted Salary', 'y': 'Actual Salary'})
fig.show()
Inference: The code `model.score(X_test, y_test)` calculates the R-squared (R2) score for the trained model on the provided test data. In this case, the obtained score is 0.6294. The R2 score, ranging from 0 to 1, represents the proportion of the variance in the dependent variable (y_test) that is predictable from the independent variable (X_test). A score of 0.6294 suggests that approximately 62.94% of the variability in the actual job salary values is captured by the model, indicating a moderate level of predictive performance.
The scatter plot `fig` visualizes the relationship between the predicted salaries (`y_pred`) by the model and the actual salaries (`y_test`). Each point on the plot represents a data instance, comparing the predicted salary on the x-axis to the actual salary on the y-axis. This plot helps assess how well the model's predictions align with the true salary values. A diagonal alignment would indicate accurate predictions, while deviations from the diagonal line suggest discrepancies between predicted and actual values.

Preview of the dataset
df2.head()
| work_year | experience_level | employment_type | job_title | salary | salary_currency | salary_in_usd | employee_residence | remote_ratio | company_location | company_size | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2023 | Expert | Full-time | Principal Data Scientist | 80000 | EUR | 85847 | Spain | 100 | Spain | L |
| 1 | 2023 | Intermediate | Contract | ML Engineer | 30000 | USD | 30000 | United States | 100 | United States | S |
| 2 | 2023 | Intermediate | Contract | ML Engineer | 25500 | USD | 25500 | United States | 100 | United States | S |
| 3 | 2023 | Expert | Full-time | Data Scientist | 175000 | USD | 175000 | Canada | 100 | Canada | M |
| 4 | 2023 | Expert | Full-time | Data Scientist | 120000 | USD | 120000 | Canada | 100 | Canada | M |
# Extract the "job title" column
job_titles = df2['job_title']
# Calculate the frequency of each job title
title_counts = job_titles.value_counts()
# Extract the top 20 most frequent job titles
top_20_titles = title_counts.head(20)
# Create a DataFrame for the top 20 titles
top_20_df = pd.DataFrame({'Job Title': top_20_titles.index, 'Count': top_20_titles.values})
# Plotting the count plot
plt.figure(figsize=(12, 6))
sns.set(style="darkgrid")
ax = sns.barplot(data=top_20_df, x='Count', y='Job Title', palette='cubehelix')
plt.xlabel('Count')
plt.ylabel('Job Titles')
plt.title('Top 20 Most Frequent Job Titles')
# Add count labels to the bars
for i, v in enumerate(top_20_df['Count']):
ax.text(v + 0.2, i, str(v), color='black', va='center')
plt.tight_layout()
plt.show()
/var/folders/s0/9btq4s49301fpm3k_2styy_80000gn/T/ipykernel_9645/96348835.py:16: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.
Inference 1:Among the 20 most common data jobs globally, data engineering emerges as the predominant role, taking the lead, followed sequentially by data scientist, data analyst, and machine learning engineer. This hierarchy sheds light on the key roles driving the data landscape, emphasizing the critical importance of data engineering in the forefront of contemporary data-centric professions.
# Calculate the number of individuals in each experience level
level_counts = df2['experience_level'].value_counts()
# Create a pie chart
plt.figure(figsize=(7,12),dpi=80)
plt.pie(level_counts.values, labels=level_counts.index, autopct='%1.1f%%')
plt.title('Experience Level Distribution')
plt.show()
Inference 2: nference 2 highlights a distinct pattern in experience levels, revealing a prevalence of experts, followed by intermediate professionals and, subsequently, junior roles.
# Create bar chart
average_salary = df2.groupby('job_title')['salary_in_usd'].mean().sort_values(ascending=False)
top_ten_salaries = average_salary.head(10)
plt.figure(figsize=(15,10),dpi=80)
plt.bar(top_ten_salaries.index, top_ten_salaries)
# Add labels to the chart
plt.xlabel('Job')
plt.ylabel('Salary $')
plt.title('Average of the ten highest salaries by Job Titles')
plt.xticks(rotation=35, ha='right')
plt.gca().yaxis.set_major_formatter(ticker.StrMethodFormatter('{x:,.0f}'))
plt.show()
Inference 3: The examination clearly indicates that Data Science Lead commands the highest global salaries within the data field, showcasing a substantial earnings gap compared to roles such as Data Analytics Lead, Data Engineer, Machine Learning Engineer, and Data Science Manager, who all earn relatively similar moderate salaries. This insight underscores the distinct salary hierarchies across various key positions in the data domain.
# Create bar chart
average_salary = df2.groupby('company_location')['salary_in_usd'].mean().sort_values(ascending=False)
top_ten_countries = average_salary.head(10)
plt.figure(figsize=(15,10),dpi=80)
plt.bar(top_ten_countries.index, top_ten_countries)
# Add labels to the chart
plt.xlabel('Country')
plt.ylabel('Salary $')
plt.title('Average of the ten highest salaries by country')
plt.xticks(rotation=20, ha='right')
plt.gca().yaxis.set_major_formatter(ticker.StrMethodFormatter('{x:,.0f}'))
Global Map Visualization
import geopandas as gpd
fig, ax = plt.subplots(1, 1, figsize=(15, 10))
# Load the GeoJSON file for countries
world = gpd.read_file('countries.geo.json')
# Merge dataset with the GeoDataFrame
merged = world.merge(df2, left_on='name', right_on='company_location')
# Plot the choropleth map
merged.plot(column='salary_in_usd', cmap='RdYlGn', linewidth=0.8, ax=ax, edgecolor='0.8', legend=True)
# Customize the plot
ax.set_title('Salary Distribution by Country')
ax.set_axis_off()
# Show the plot
plt.show()
Inference 4: The findings suggest that, on a global scale, Israel stands out for offering significantly higher salaries, denominated in USD, compared to other countries. This observation underscores a notable disparity in compensation rates between Israel and the rest of the world.
# Filter out cases where company location is different from employee residence
foreign_employees = df2[df2['company_location'] != df2['employee_residence']]
# Count the occurrences of employee residence countries
count = foreign_employees['employee_residence'].value_counts().head(10)
# Create a bar chart
plt.figure(figsize=(15, 10), dpi=80)
plt.bar(count.index, count)
# Add labels to the chart
plt.xlabel('Employee Residence Country')
plt.ylabel('Number of Employees')
plt.title('Employeee Countries Hired most from Company Based Abroad')
plt.xticks(rotation=20, ha='right')
plt.gca().yaxis.set_major_formatter(ticker.StrMethodFormatter('{x:,.0f}'))
plt.show()
Inference 5 Evidently, a substantial majority of Indian employees find employment opportunities with companies based abroad, indicating a significant trend in cross-border hiring. This notable pattern emphasizes the strong global demand for Indian professionals in various sectors.
data_filtered = df2[df2['remote_ratio'] == 100]
# Get the top 10 job titles with 100% remote ratio
top_10_job_titles = data_filtered['job_title'].value_counts().nlargest(10).index
# Filter the data for the top 10 job titles
data_filtered_top_10 = data_filtered[data_filtered['job_title'].isin(top_10_job_titles)]
# Create a horizontal bar chart with color and labels
fig = ex.bar(data_filtered_top_10, x='salary_in_usd', y='job_title', orientation='h',
color='experience_level', labels={'experience_level': 'Experience','job_title': 'Job Title','salary_in_usd':'Salary'})
# Customize layout
fig.update_layout(
title='Top 10 Job Titles with 100% Remote Ratio: Salary vs Experience Level',
xaxis_title='',
yaxis_title='Job Title',
coloraxis_colorbar=dict(title='Experience Level'),
)
# Show the figure
fig.show()
Inference 6:The analysis underscores that remote jobs offering high salaries are primarily secured by experts, with a moderate representation for intermediate professionals. However, the opportunities for junior-level positions in securing such lucrative remote roles appear comparatively limited. This insight highlights a correlation between experience levels and the likelihood of obtaining well-compensated remote positions.
# Calculate the average remote ratio for each year
average_remote_ratio = df2.groupby('work_year')['remote_ratio'].mean().reset_index()
# Create a line plot for the average remote ratio
fig = ex.line(average_remote_ratio, x='work_year', y='remote_ratio', markers=True, title='Average Remote Ratio by Year')
# Customize layout
fig.update_layout(
xaxis_title='Year',
yaxis_title='Average Remote Ratio',
)
# Show the figure
fig.show()
Inference 7 The graph indicates that the peak of remote work occurred in 2021 during the COVID-19 pandemic, but there is now a discernible slowdown in the prevalence of remote jobs. This observation suggests a potential shift in the remote work landscape, reflecting changing workplace dynamics post-pandemic.
# Filter data for the specified years (2020-2023) and job title 'Data Scientist'
data_filtered = df2[(df2['work_year'].isin([2020, 2021, 2022, 2023])) & (df2['job_title'] == 'Data Scientist')]
# Create a scatter plot
fig = ex.scatter(data_filtered, x='work_year', y='salary_in_usd', color='salary_in_usd', size='salary_in_usd')
# Customize layout
fig.update_layout(
title='Year-wise Salary Scatter Plot for Data Scientists (2020-2023)',
xaxis_title='Year',
yaxis_title='Salary',
)
# Show the figure
fig.show()
Inference 8 : The scatter plot reveals a consistent upward trajectory in salaries for data scientists from 2020 to 2023, indicating a steady and positive trend in compensation over the years. This observation suggests a favorable market for data science professionals, with increasing recognition and value assigned to their skill set.
Job Demand Disparity:
Salary Discrepancy:
Remote Job Expertise:
Global Technology Trends:
Global Salary Disparities:
International Workforce Dynamics:
Post-Pandemic Remote Work Landscape:
Future Work:
Enhancement of Salary Prediction Model:
Global Salary Prediction Model:
Cross-Industry Exploration:
International Salary Comparisons:
Diversification Beyond Data:
Exclusive Analysis of Remote Work Data:

*Your salary is the bribe they give you to forget your dreams* - "Embrace the journey of chasing your dreams, for in the pursuit of passion, success follows. Your salary may be a temporary reward, but the real treasure lies in the fulfillment of your aspirations. Don't merely run towards a paycheck; sprint towards your dreams, and watch as the currency of passion and determination transforms into the wealth of a purposeful and rewarding life."
Thank You
Arpita Halder
Masters in Artificial Intelligence
Martrikel Nr. 22974970